摘要 :
In-memory computing is emerging as a promising paradigm in commodity servers to accelerate data-intensive processing by striving to keep the entire dataset in DRAM. To address the tremendous pressure on the main memory system, dis...
展开
In-memory computing is emerging as a promising paradigm in commodity servers to accelerate data-intensive processing by striving to keep the entire dataset in DRAM. To address the tremendous pressure on the main memory system, discrete memory modules can be networked together to form a memory pool, enabled by recent trends towards richer memory interfaces (e.g. Hybrid Memory Cubes, or HMCs). Such an inter-memory network provides a scalable fabric to expand memory capacity, but still suffers from long multi-hop latency, limited bandwidth, and high power consumption - problems that will continue to exacerbate as the gap between interconnect and transistor performance grows. Moreover, inside each memory module, an intra-memory network (NoC) is typically employed to connect different memory partitions. Without careful design, the back-pressure inside the memory modules can further propagate to the inter-memory network to cause a performance bottleneck. To address these problems, we propose co-optimization of intra- and inter-memory network. First, we re-organize the intra-memory network structure, and provide a smart I/O interface to reuse the intra-memory NoC as the network switches for inter-memory communication, thus forming a unified memory network. Based on this architecture, we further optimize the inter-memory network for both high performance and lower energy, including a distance-aware selective compression scheme to drastically reduce communication burden, and a light-weight power-gating algorithm to turn off under-utilized links while guaranteeing a connected graph and deadlock-free routing. We develop an event-driven simulator to model our proposed architectures. Experiment results based on both synthetic traffic and real big-data workloads show that our unified memory network architecture can achieve 75.1% average memory access latency reduction and 22.1% total memory energy saving.
收起
摘要 :
In-memory computing is emerging as a promising paradigm in commodity servers to accelerate data-intensive processing by striving to keep the entire dataset in DRAM. To address the tremendous pressure on the main memory system, dis...
展开
In-memory computing is emerging as a promising paradigm in commodity servers to accelerate data-intensive processing by striving to keep the entire dataset in DRAM. To address the tremendous pressure on the main memory system, discrete memory modules can be networked together to form a memory pool, enabled by recent trends towards richer memory interfaces (e.g. Hybrid Memory Cubes, or HMCs). Such an inter-memory network provides a scalable fabric to expand memory capacity, but still suffers from long multi-hop latency, limited bandwidth, and high power consumption - problems that will continue to exacerbate as the gap between interconnect and transistor performance grows. Moreover, inside each memory module, an intra-memory network (NoC) is typically employed to connect different memory partitions. Without careful design, the back-pressure inside the memory modules can further propagate to the inter-memory network to cause a performance bottleneck. To address these problems, we propose co-optimization of intra- and inter-memory network. First, we re-organize the intra-memory network structure, and provide a smart I/O interface to reuse the intra-memory NoC as the network switches for inter-memory communication, thus forming a unified memory network. Based on this architecture, we further optimize the inter-memory network for both high performance and lower energy, including a distance-aware selective compression scheme to drastically reduce communication burden, and a light-weight power-gating algorithm to turn off under-utilized links while guaranteeing a connected graph and deadlock-free routing. We develop an event-driven simulator to model our proposed architectures. Experiment results based on both synthetic traffic and real big-data workloads show that our unified memory network architecture can achieve 75.1% average memory access latency reduction and 22.1% total memory energy saving.
收起
摘要 :
High-performance computing, enterprise, and datacenter servers are driving demands for higher total memory capacity as well as memory performance. Memory “cubes” with high per-package capacity (from 3D integration) along with hi...
展开
High-performance computing, enterprise, and datacenter servers are driving demands for higher total memory capacity as well as memory performance. Memory “cubes” with high per-package capacity (from 3D integration) along with high-speed point-to-point interconnects provide a scalable memory system architecture with the potential to deliver both capacity and performance. Multiple such cubes connected together can form a “Memory Network” (MN), but the design space for such MNs is quite vast, including multiple topology types and multiple memory technologies per memory cube. In this work, we first analyze several MN topologies with different mixes of memory package technologies to understand the key tradeoffs and bottlenecks for such systems. We find that most of a MN's performance challenges arise from the interconnection network that binds the memory cubes together. In particular, arbitration schemes used to route through MNs, ratio of NVM to DRAM, and specific topologies used have dramatic impact on performance and energy results. Our initial analysis indicates that introducing non-volatile memory to the MN presents a unique tradeoff between memory array latency and network latency. We observe that placing NVM cubes in a specific order in the MN improves performance by reducing the network size/diameter up to a certain NVM to DRAM ratio. Novel MN topologies and arbitration schemes also provide performance and energy deltas by reducing the hop count of requests and response in the MN. Based on our analyses, we introduce three techniques to address MN latency issues: (1) Distance-based arbitration scheme to improve queuing latencies throughout the network, (2) skip-list topology, derived from the classic data structure, to improve network latency and link usage, and (3) the MetaCube, a denser memory cube that leverages advanced packaging technologies to improve latency by reducing MN size.
收起
摘要 :
High-performance computing, enterprise, and datacenter servers are driving demands for higher total memory capacity as well as memory performance. Memory “cubes” with high per-package capacity (from 3D integration) along with hi...
展开
High-performance computing, enterprise, and datacenter servers are driving demands for higher total memory capacity as well as memory performance. Memory “cubes” with high per-package capacity (from 3D integration) along with high-speed point-to-point interconnects provide a scalable memory system architecture with the potential to deliver both capacity and performance. Multiple such cubes connected together can form a “Memory Network” (MN), but the design space for such MNs is quite vast, including multiple topology types and multiple memory technologies per memory cube. In this work, we first analyze several MN topologies with different mixes of memory package technologies to understand the key tradeoffs and bottlenecks for such systems. We find that most of a MN's performance challenges arise from the interconnection network that binds the memory cubes together. In particular, arbitration schemes used to route through MNs, ratio of NVM to DRAM, and specific topologies used have dramatic impact on performance and energy results. Our initial analysis indicates that introducing non-volatile memory to the MN presents a unique tradeoff between memory array latency and network latency. We observe that placing NVM cubes in a specific order in the MN improves performance by reducing the network size/diameter up to a certain NVM to DRAM ratio. Novel MN topologies and arbitration schemes also provide performance and energy deltas by reducing the hop count of requests and response in the MN. Based on our analyses, we introduce three techniques to address MN latency issues: (1) Distance-based arbitration scheme to improve queuing latencies throughout the network, (2) skip-list topology, derived from the classic data structure, to improve network latency and link usage, and (3) the MetaCube, a denser memory cube that leverages advanced packaging technologies to improve latency by reducing MN size.
收起
摘要 :
Interposer-based packaging is becoming a widespread methodology for tightly integrating multiple heterogeneous dies into a single package, with the potential to improve manufacturing yield and build larger-than-reticle-sized syste...
展开
Interposer-based packaging is becoming a widespread methodology for tightly integrating multiple heterogeneous dies into a single package, with the potential to improve manufacturing yield and build larger-than-reticle-sized systems. However, interposer integration also introduces possible communication bottlenecks and cost overheads that can outweigh these benefits. To avoid these drawbacks, the abundant interposer interconnect can be leveraged as network-on-chip interconnection fabric to provide high-bandwidth, low-latency communication between chiplets and memory stacks. This work investigates this new interposer design space of passive and active interposer technologies, network-on-chip topologies, and clocking schemes to determine the cost-optimal interposer architectures for a range of performance requirements.
收起
摘要 :
Interposer-based packaging is becoming a widespread methodology for tightly integrating multiple heterogeneous dies into a single package, with the potential to improve manufacturing yield and build larger-than-reticle-sized syste...
展开
Interposer-based packaging is becoming a widespread methodology for tightly integrating multiple heterogeneous dies into a single package, with the potential to improve manufacturing yield and build larger-than-reticle-sized systems. However, interposer integration also introduces possible communication bottlenecks and cost overheads that can outweigh these benefits. To avoid these drawbacks, the abundant interposer interconnect can be leveraged as network-on-chip interconnection fabric to provide high-bandwidth, low-latency communication between chiplets and memory stacks. This work investigates this new interposer design space of passive and active interposer technologies, network-on-chip topologies, and clocking schemes to determine the cost-optimal interposer architectures for a range of performance requirements.
收起
摘要 :
Three-dimensional (3D) integration is considered as a solution to overcome capacity, bandwidth, and performance limitations of memories. However, due to thermal challenges and cost issues, industry embraced 2.5D implementation for...
展开
Three-dimensional (3D) integration is considered as a solution to overcome capacity, bandwidth, and performance limitations of memories. However, due to thermal challenges and cost issues, industry embraced 2.5D implementation for integrating die-stacked memories with large-scale designs, which is enabled by silicon interposer technology that integrates processors and multiple modules of 3D-stacked memories in the same package. Previous work has adopted Network-on-Chip (NoC) concepts for the communication fabric of 3D designs, but the design of a scalable processor-memory interconnect for 2.5D integration remains elusive. Therefore, in this work, we first explore different network topologies for integrating CPUs and memories in a silicon interposer-based multi-core system and reveal that simple point-to-point connections cannot reach the full potential of the memory performance due to bandwidth limitations, especially as more and more memory modules are needed to enable emerging applications with high memory capacity and bandwidth demand, such as in-memory computing. To overcome this scaling problem, we propose a memory network design to directly connect all the memory modules, utilizing the existing routing resource of silicon interposers in 2.5D designs. Observing the unique network traffic in our design, we present a design space exploration that evaluates network topologies and routing algorithms, taking process node and interposer technology design decisions into account. We implement an event-driven simulator to evaluate our proposed memory network in silicon interposer (MemNiSI) design with synthetic traffic as well as real in-memory computing workloads. Our experimental results show that compared to baseline designs, MemNiSI topology reduces the average packet latency by up to 15.3% and Choose Fastest Path (CFP) algorithm further reduces by up to 8.0%. Our scheme can utilize the potential of integrated stacked memory effectively while providing better scalability and infrastructure for large-scale silicon interposer-based 2.5D designs.
收起
摘要 :
Three-dimensional (3D) integration is considered as a solution to overcome capacity, bandwidth, and performance limitations of memories. However, due to thermal challenges and cost issues, industry embraced 2.5D implementation for...
展开
Three-dimensional (3D) integration is considered as a solution to overcome capacity, bandwidth, and performance limitations of memories. However, due to thermal challenges and cost issues, industry embraced 2.5D implementation for integrating die-stacked memories with large-scale designs, which is enabled by silicon interposer technology that integrates processors and multiple modules of 3D-stacked memories in the same package. Previous work has adopted Network-on-Chip (NoC) concepts for the communication fabric of 3D designs, but the design of a scalable processor-memory interconnect for 2.5D integration remains elusive. Therefore, in this work, we first explore different network topologies for integrating CPUs and memories in a silicon interposer-based multi-core system and reveal that simple point-to-point connections cannot reach the full potential of the memory performance due to bandwidth limitations, especially as more and more memory modules are needed to enable emerging applications with high memory capacity and bandwidth demand, such as in-memory computing. To overcome this scaling problem, we propose a memory network design to directly connect all the memory modules, utilizing the existing routing resource of silicon interposers in 2.5D designs. Observing the unique network traffic in our design, we present a design space exploration that evaluates network topologies and routing algorithms, taking process node and interposer technology design decisions into account. We implement an event-driven simulator to evaluate our proposed memory network in silicon interposer (MemNiSI) design with synthetic traffic as well as real in-memory computing workloads. Our experimental results show that compared to baseline designs, MemNiSI topology reduces the average packet latency by up to 15.3% and Choose Fastest Path (CFP) algorithm further reduces by up to 8.0%. Our scheme can utilize the potential of integrated stacked memory effectively while providing better scalability and infrastructure for large-scale silicon interposer-based 2.5D designs.
收起
摘要 :
Domain-specific accelerators for graph analytics leverage a large on-chip memory in order to tackle the intensive random memory accesses, offering higher performance and energy efficiency than conventional architectures. However, ...
展开
Domain-specific accelerators for graph analytics leverage a large on-chip memory in order to tackle the intensive random memory accesses, offering higher performance and energy efficiency than conventional architectures. However, limited by the inefficient usage of on-chip memory, current accelerators suffer from energy and performance bottlenecks due to the large amount of off-chip memory accesses. In this work, we introduce an online preprocessing step for the vertex-centric programming model based on our observation of imbalanced memory bandwidth utilization between two execution phases. Our scheme improves energy efficiency and performance by significantly reducing off-chip accesses in two ways. First, we sequence random off-chip memory accesses to balance memory bandwidth demands and improve the utilization of on-chip memory. Second, we prune active leaf vertices to avoid redundant memory accesses. We evaluate our method on a state-of-the-art graph analytics accelerator and achieve 1.6× speedup while reducing energy consumption by 42% on average.
收起
摘要 :
Domain-specific accelerators for graph analytics leverage a large on-chip memory in order to tackle the intensive random memory accesses, offering higher performance and energy efficiency than conventional architectures. However, ...
展开
Domain-specific accelerators for graph analytics leverage a large on-chip memory in order to tackle the intensive random memory accesses, offering higher performance and energy efficiency than conventional architectures. However, limited by the inefficient usage of on-chip memory, current accelerators suffer from energy and performance bottlenecks due to the large amount of off-chip memory accesses. In this work, we introduce an online preprocessing step for the vertex-centric programming model based on our observation of imbalanced memory bandwidth utilization between two execution phases. Our scheme improves energy efficiency and performance by significantly reducing off-chip accesses in two ways. First, we sequence random off-chip memory accesses to balance memory bandwidth demands and improve the utilization of on-chip memory. Second, we prune active leaf vertices to avoid redundant memory accesses. We evaluate our method on a state-of-the-art graph analytics accelerator and achieve 1.6× speedup while reducing energy consumption by 42% on average.
收起